Coding for DS and DM
R coding module

Lecture 7

Andrea Cappozzo
andrea.cappozzo@unimi.it
AndreaCappozzo
andreacappozzo.rbind.io

Random number generators

  • We have already encountered the set.seed function
set.seed(1)
rnorm(10)
 [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078 -0.8204684
 [7]  0.4874291  0.7383247  0.5757814 -0.3053884
  • That allows to set the state for random number generation in R
set.seed(1)
rnorm(10)
 [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078 -0.8204684
 [7]  0.4874291  0.7383247  0.5757814 -0.3053884
  • Essential for reproducibility whenever randomness comes into play!

Random number generators

  • In statistics, randomness is vital.
  • Random values from random variables are fundamental, especially in simulation methods.
  • Computational statistics is based (also) on random numbers and numerical simulation.
  • Two definitions of random numbers:
    • Classic;
    • Modern

Random number generators (cont’d)

  • Ideally, in the classic definition, a random number should come from a Continuous Uniform distribution with parameters 0,1:

\[ X \sim U(0,1) \]

  • So, the definition is:

    A sequence of random numbers is formed by real numbers in the interval (0,1) generated in sequence, independently and with the same probability.

Random number generators (cont’d)

  • Historically, traditional generators involved mechanical devices used to generate random numbers:
    • Lotto games
    • Dice
    • Coins
    • Cards
    • Roulette, etc.

Random number generators (cont’d)

  • Pseudo-random number (PRN) generators:
    • The term ‘pseudo’ indicates that the generated values are not truly random (they should be called fake random numbers).
    • PRNs are the result of an ordered and finite series of deterministic operations.
    • These algorithms are often recursive, sometimes causing autocorrelation in the series of values produced, contrary to the desired independence.

Random number generators (cont’d)

  • Characteristics of good pseudo RN generators:
    • They should produce numbers that appear to be uniformly distributed in the interval (0, 1), with minimal dependence structure.
    • They must be fast enough for simulation purposes.
    • They should have efficient computational memory requirements.
    • They must ensure reproducibility of the generated series (to validate simulation results across different conditions).

Random number generators (cont’d)

  • Modern RNG methods are often based on recursive algebraic algorithms, starting from a seed (an arbitrary chosen value or vector of values), recursively generating new values from previously generated ones.
  • Congruential generators are based on the modulus operator:

\[ X_{n+1} = X_n (\text{ mod } m) \]

  • Not very interesting for statisticians, if you wish to know more look here

A simple and motivating example

  • Random Number Generation has a plethora (and ever evolving!) set of applications
  • Let us consider the following problem: how to compute the irrational number \(\pi\)?
  • From primary school, you should remember that for a circle you have that

\[ A=\pi r^2 \] where \(A\) is the area of the circle and \(r\) its radius

A simple and motivating example

  • We now consider the following (random) procedure.
    1. you throw nsim grains of sand within a square of side \(l=2\).
    2. you count how many falls within the circle inscribed in the square.
    3. The proportion of grains falling within the circle out of the total is an approximation of the ratio between the areas of circle and square: \[ Prop~grains~in~the~circle \approx \frac{A_{circle}}{A_{square}} = \frac{\pi r^2}{l^2} \]

A simple and motivating example

  1. You can compute \(\pi\) simply inverting the formula! \[ \pi \approx \frac{A_{circle}}{A_{square}}l^2 \]
  • Let us code up this simple random procedure in R!
  • Check the approximating_pi_with_sand.R script

Now your turn!

  • Write an R function to generate random numbers from a standard Normal distribution using the Box-Muller transform
  • Wikipedia provides a good reference
  • You have 10 minutes from now!
10:00

Random variable generators

  • Random Variable Generation (RVG) relies on producing (via computation) a supposedly endless flow of iid values from specific random variables
  • So far, we have been concerned with generating random variables using the uniform distribution \(U \sim (a,b)\).
  • We will leverage certain characteristics of the uniform distribution to generate other random variables.
  • Starting from simulated uniform random variables, we will explore a basic methodology that can generate random variables from other distributions, provided we know how to write the quantile function

The inverse transform algorithm

  • Let \(F(x)\) be any cumulative distribution function (cdf) of random variable. Define \(X = F^{-1}(U)\), where \(U\) is a continuous uniform random variable in the interval \((0,1)\). Then, \(X\) is distributed as \(F\), i.e., \(P(X \leq x) = F(x)\).

Do not reinvent the wheel

  • As wannabe applied statisticians/data scientists, you will rarely need to implement your own sampling algorithm, as many efficient solutions are already out there
  • rnorm, rchisq, rpois, rbinom and many many (even much more complex!) more
  • However, it is always beneficial to have at least some understanding of how things work behind the scenes
  • Moreover, if you are considering a career in research, this knowledge becomes even more valuable

Meme of the day